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Abstract 

Replicated Softmax model, a well-known 
undirected topic model, is powerful in ex¬ 
tracting semantic representations of docu¬ 
ments. Traditional learning strategies such 
as Contrastive Divergence are very inef¬ 
ficient. This paper provides a novel esti¬ 
mator to speed up the learning based on 
Noise Contrastive Estimate, extended for 
documents of variant lengths and weighted 
inputs. Experiments on two benchmarks 
show that the new estimator achieves great 
learning efficiency and high accuracy on 
document retrieval and classification. 

1 Introduction 

Topic models are powerful probabilistic graphical 
approaches to analyze document semantics in dif¬ 
ferent applications such as document categoriza¬ 
tion and information retrieval. They are mainly 
constructed by directed structure like pLSA (Hof¬ 
mann, 2000) and LDA (Blei et al., 2003). Accom¬ 
panied by the vast developments in deep learn¬ 
ing, several undirected topic models, such as 
(Salakhutdinov and Hinton, 2009; Srivastava et 
al., 2013), have recently been reported to achieve 
great improvements in efficiency and accuracy. 

Replicated Softmax model (RSM) (Hinton and 
Salakhutdinov, 2009), a kind of typical undirected 
topic model, is composed of a family of Restricted 
Boltzmann Machines (RBMs). Commonly, RSM 
is learned like standard RBMs using approximate 
methods like Contrastive Divergence (CD). How¬ 
ever, CD is not really designed for RSM. Different 
from RBMs with binary input, RSM adopts soft- 
max units to represent words, resulting in great in¬ 
efficiency with sampling inside CD, especially for 
a large vocabulary. Yet, NLP systems usually re¬ 
quire vocabulary sizes of tens to hundreds of thou¬ 
sands, thus seriously limiting its application. 


Dealing with the large vocabulary size of the in¬ 
puts is a serious problem in deep-learning-based 
NLP systems. Bengio et al. (2003) pointed this 
problem out when normalizing the softmax proba¬ 
bility in the neural language model (NNLM), and 
Morin and Bengio (2005) solved it based on a hi¬ 
erarchical binary tree. A similar architecture was 
used in word representations like (Mnih and Hin¬ 
ton, 2009; Mikolov et al., 2013a). Directed tree 
structures cannot be applied to undirected mod¬ 
els like RSM, but stochastic approaches can work 
well. For instance, Dahl et al. (2012) found that 
several Metropolis Hastings sampling (MH) ap¬ 
proaches approximate the softmax distribution in 
CD well, although MH requires additional com¬ 
plexity in computation. HyvMnen (2007) pro¬ 
posed Ratio Matching (RM) to train unnormal¬ 
ized models, and Dauphin and Bengio (2013) 
added stochastic approaches in RM to accommo¬ 
date high-dimensional inputs. Recently, a new es¬ 
timator Noise Contrastive Estimate (NCE) (Gut- 
mann and HyvMnen, 2010) is proposed for un¬ 
normalized models, and shows great efficiency in 
learning word representations such as in (Mnih 
and Teh, 2012; Mikolov et al., 2013b). 

In this paper, we propose an efficient learning 
strategy for RSM named a-NCE, applying NCE as 
the basic estimator. Different from most related ef¬ 
forts that use NCE for predicting single word, our 
method extends NCE to generate noise for doc¬ 
uments in variant lengths. It also enables RSM to 
use weighted inputs to improve the modelling abil¬ 
ity. As RSM is usually used as the first layer in 
many deeper undirected models like Deep Boltz¬ 
mann Machines (Srivastava et al., 2013), a-NCE 
can be readily extended to learn them efficiently. 

2 Replicated Softmax Model 

RSM is a typical undirected topic model, which is 
based on bag-of-words (BoW) to represent docu¬ 
ments. In general, it consists of a series of RBMs, 



each of which contains variant softmax visible 
units but the same binary hidden units. 

Suppose K is the vocabulary size. For a docu¬ 
ment with D words, if the word in the docu¬ 
ment equals the word of the dictionary, a vec¬ 
tor Vi G { 0 , 1 }^ is assigned, only with the 
element Vik — 1. An RBM is formed by assign¬ 
ing a hidden state h G { 0 , 1 }^ to this document 
V = {t?!, where the energy function is: 

Ee{V, h) = -h^Wv -h^v-D- a^h (1) 


consistency improve when more steps are adopted. 
Notwithstanding, even one Gibbs step is time con¬ 
suming for RSM, since the multinomial sampling 
normally requires linear time computations. The 
“alias method” (Kronmal and Peterson Jr, 1979) 
speeds up multinomial sampling to constant time 
while linear time is required for processing the dis¬ 
tribution. Since Pe{V\h) changes at every itera¬ 
tion in CD, such methods cannot be used. 

3 Efficient Learning for RSM 


where 6 = {VL,5, a} are parameters shared by 
all the RBMs, and v = commonly re¬ 

ferred to as the word count vector of a document. 
The probability for the document V is given by: 




( 2 ) 


where Fo{V) is the “free energy”, which can be 
analytically integrated easily, and Zd is the “par¬ 
tition function” for normalization, only associated 
with the document length D. As the hidden state 
and document are conditionally independent, the 
conditional distributions are derived: 


Pe ivik = 1|^) 


exp (h + hk) 
Ef=iexp(W^^/i + 6fc) 


Pg {hj = l\V) = a {WjV + D ■ aj) 


( 3 ) 

( 4 ) 


where a{x) = ■ Equation (3) is the soft- 

max units describing the multinomial distribution 
of the words, and Equation (4) serves as an effi¬ 
cient inference from words to semantic meanings, 
where we adopt the probabilities of each hidden 
unit “activated” as the topic features. 


2.1 Learning Strategies for RSM 

RSM is naturally learned by minimizing the nega¬ 
tive log-likelihood function (ML) as follows: 


L(0) = -Ev~P,„,JlogPe(F)] (5) 


However, the gradient is intractable for the combi¬ 
natorial normalization term Z^. Common strate¬ 
gies to overcome this intractability are MCMC- 
based approaches such as Contrastive Divergence 
(CD) (Hinton, 2002) and Persistent CD (PCD) 
(Tieleman, 2008), both of which require repeating 
Gibbs steps of ~ P 0 (^|FW) and ~ 

P 0 {V\h^^^) to generate model samples to approx- 
imate the gradient. Typically, the performance and 


Unlike (Dahl et al., 2012) that retains CD, we 
adopted NCE as the basic learning strategy. Con¬ 
sidering RSM is designed for documents, we fur¬ 
ther modified NCE with two novel heuristics, 
developing the approach “Partial Noise Uniform 
Contrastive Estimate” (or a-NCE for short). 

3.1 Noise Contrastive Estimate 

Noise Contrastive Estimate (NCE), similar to CD, 
is another estimator for training models with in¬ 
tractable partition functions. NCE solves the in¬ 
tractability through treating the partition function 
Zd as m additional parameter added to 6 , 
which makes the likelihood computable. Yet, the 
model cannot be trained through ML as the likeli¬ 
hood tends to be arbitrarily large by setting zh to 
huge numbers. Instead, NCE learns the model in a 
proxy classification problem with noise samples. 

Given a document collection (data) { and 

another collection (noise) {Vn}Tn with — kTd, 
NCE distinguishes these (l-hA:)T^ documents sim¬ 
ply based on Bayes’ Theorem, where we assumed 
data samples matched by our model, indicating 
Pq 22 ^ Pdata^ and noise samples generated from an 
artificial distribution P^. Parameters are learned 
by minimizing the cross-entropy function: 

J{e) = [logak{X{Vd))] 

-mv„^P„[logak-ii-XiVn))] 
and the gradient is derived as follows, 

-X0j{0) ^Ev.^Pe [ak-d-X)XeX{Vd)] 

[ak{X)X0X{Vn)] 

where ^ and the “log-ratio” is: 

X{V)^\og[Pe{V)/Pn{V)] ( 8 ) 

J{6) can be optimized efficiently with stochastic 
gradient descent (SGD). Gutmann and HyvMnen 
(2010) showed that the NCE gradient VeJiO) will 
reach the ML gradient when A; ^ oc. In practice, 
a larger k tends to train the model better. 



3.2 Partial Noise Sampling 

Different from (Mnih and Teh, 2012), which gen¬ 
erates noise per word, RSM requires the estimator 
to sample the noise at the document level. An in¬ 
tuitive approach is to sample from the empirical 
distribution p for D times, where the log probabil- 
ity is computed: log PniV) = Y^vev [i^^logp]. 
For a fixed k, Gutmann and HyvMnen (2010) 
suggested choosing the noise close to the data for 
a sufficient learning result, indicating full noise 
might not be satisfactory. We proposed an alter¬ 
native “Partial Noise Sampling (PNS)” to gener¬ 
ate noise by replacing part of the data with sam¬ 
pled words. See Algorithm 1, where we fixed the 

Algorithm 1 Partial Noise Sampling 
1: Initialize: k^a ^ (0,1) 

2: for each Vd = { v}d e do 

3: Set: Dr ^\a-D] 

4: Draw: Vr = {Vr}Dr ^ ^ uniformly 

5: for j = 1, k do 

6: Draw: ~ P 

7: U Vr 

8 : end for 

9 : Bind: {Vd, Vr), V), {V^’^\Vr) 

10 : end for 


proportion of remaining words at a, named “noise 
level” of PNS. However, traversing all the condi¬ 
tions to guess the remaining words requires 0(D!) 
computations. To avoid this, we simply bound the 
remaining words with the data and noise in ad¬ 
vance and the noise log V) is derived readily: 

( 9 ) 

where the remaining words Vr are still assumed 
to be described by RSM with a smaller document 
length. In this way, it also strengthens the robust¬ 
ness of RSM towards incomplete data. 

Sampling the noise normally requires additional 
computational load. Fortunately, since p is fixed, 
sampling is efficient using the “alias method”. It 
also allows storing the noise for subsequent use, 
yielding much faster computation than CD. 

3.3 Uniform Contrastive Estimate 

When we initially implemented NCE for RSM, 
we found the document lengths terribly biased the 
log-ratio, resulting in bad parameters. Therefore 
“Uniform Contrastive Estimate (UCE)” was pro¬ 
posed to accommodate variant document lengths 


by adding the uniform assumption: 

X{V) = D-^ log [P0{V)/Pn{V)] (10) 

where UCE adopts the uniform probabilities ^/Pq 
and for classification to average the mod¬ 

elling ability at word-level. Note that D is not 
necessarily an integer in UCE, and allows choos¬ 
ing a real-valued weights on the document such as 
/Jf-weighting (Salton and McGill, 1983). Typi¬ 
cally, it is defined as a weighting vector m, where 
Wk = log \ve{v,}:v{l=i,v,ev\ multiplied to the 
k^^ word in the dictionary. Thus for a weighted in¬ 
put and corresponding length D^, we derive: 

X{V^) = log [P 0 {V'^)/Pn{V^)\ (11) 

where log [v^'^logp]. A 
specific will be assigned to 

Combining PNS and UCE yields a new estima¬ 
tor for RSM, which we simply call a-NCE^ 

4 Experiments 

4.1 Datasets and Details of Learning 

We evaluated the new estimator to train RSMs on 
two text datasets: 20 Newsgroups and IMDB. 

The 20 Newsgroups^ dataset is a collection of 
the Usenet posts, which contains 11,345 training 
and 7,531 testing instances. Both the training and 
testing sets are labeled into 20 classes. Removing 
stop words as well as stemming were performed. 

The IMDB dataset^ is a benchmark for senti¬ 
ment analysis, which consists of 100,000 movie 
reviews taken from IMDB. The dataset is divided 
into 75,000 training instances (1/3 labeled and 
2/3 unlabeled) and 25,000 testing instances. Two 
types of labels, positive and negative, are given to 
show sentiment. Following (Maas et al., 2011), no 
stop words are removed from this dataset. 

For each dataset, we randomly selected 10% of 
the training set for validation, and the irff-weight 
vector is computed in advance. In addition, replac¬ 
ing the word count v by [log (1 + r))] slightly im¬ 
proved the modelling performance for all models. 

We implemented ck-NCE according to the pa¬ 
rameter settings in (Hinton, 2010) using SGD in 
minibatches of size 128 and an initialized learning 
rate of 0.1. The number of hidden units was fixed 

% comes from the noise level in PNS, but UCE is also 
the vital part of this estimator, which is absorbed in a-NCE. 

^Available at http://qwone.eom/~jason/20Newsgroups 

^Available at http://ai.stanford.edu/~amaas/data/sentiment 




at 128 for all models. Although learning the parti¬ 
tion function zh separately for every length D is 
nearly impossible, as in (Mnih and Teh, 2012) we 
also surprisingly found freezing zh as a constant 
function of D without updating never harmed but 
actually enhanced the performance. It is proba¬ 
bly because the large number of free parameters 
in RSM are forced to learn better when Zh is a 
constant. In practise, we set this constant function 
as = 2^ • It can readily extend to 

learn RSM for real-valued weighted length . 

We also implemented CD with the same set¬ 
tings. All the experiments were run on a single 
GPU GTX970 using the library Theano (Bergstra 
et al., 2010). To make the comparison fair, both 
cr-NCE and CD share the same implementation. 

4.2 Evaluation of Efficiency 

To evaluate the efficiency in learning, we used 
the most frequent words as dictionaries with sizes 
ranging from 100 to 20,000 for both datasets, and 
test the computation time both for CD of vari¬ 
ant Gibbs steps and cr-NCE of variant noise sam¬ 
ple sizes. The comparison of the mean running 



Eigure 1: Comparison of running time 

time per minibatch is clearly shown in Eigure 1, 
which is averaged on both datasets. Typically, 
a-NCE achieves 10 to 500 times speed-up com¬ 
pared to CD. Although both CD and cr-NCE run 
slower when the input dimension increases, CD 
tends to take much more time due to the multino¬ 
mial sampling at each iteration, especially when 
more Gibbs steps are used. In contrast, running 
time stays reasonable in a-NCE even if a larger 
noise size or a larger dimension is applied. 

4.3 Evaluation of Performance 

One direct measure to evaluate the modelling per¬ 
formance is to assess RSM as a generative model 


to estimate the log-probability per word as per¬ 
plexity. However, as cr-NCE learns RSM by dis¬ 
tinguishing the data and noise from their respec¬ 
tive features, parameters are trained more like a 
feature extractor than a generative model. It is not 
fair to use perplexity to evaluate the performance. 
Eor this reason, we evaluated the modelling per¬ 
formance with some indirect measures. 



Eigure 2: Precision-Recall curves for the retrieval 
task on the 20 Newsgroups dataset using RSMs. 

Eor 20 Newsgroups, we trained RSMs on the 
training set, and reported the results on docu¬ 
ment retrieval and document classification. Eor 
retrieval, we treated the testing set as queries, and 
retrieved documents with the same labels in the 
training set by cosine-similarity. Precision-recall 
(P-R) curves and mean average precision (MAP) 
are two metrics we used for evaluation. Eor clas¬ 
sification, we trained a softmax regression on the 
training set, and checked the accuracy on the test¬ 
ing set. We use this dataset to show the modelling 
ability of RSM with different estimators. 

Eor IMDB, the whole training set is used for 
learning RSMs, and an L2-regularized logistic re¬ 
gression is trained on the labeled training set. The 
error rate of sentiment classification on the testing 
set is reported, compared with several BoW-based 
baselines. We use this dataset to show the general 
modelling ability of RSM compared with others. 

We trained both a-NCE and CD, and naturally 
NCE (without UCE) at a fixed vocabulary size 
(2000 for 20 Newsgroups, and 5000 for IMDB). 
Posteriors of the hidden units were used as topic 
features. Eor a-NCE , we fixed noise level at 0.5 
for 20 Newsgroups and 0.3 for IMDB. In compar¬ 
ison, we trained CD from 1 up to 5 Gibbs steps. 

Eigure 2 and Table 1 show that a larger noise 
size in a-NCE achieves better modelling perfor- 
























(a) MAP for document retrieval 


(b) Document classification accuracy (c) Sentiment classification accuracy 


Figure 3: Tracking the modelling performance with variant ol using ct-NCE to learn RSMs. CD is also 
reported as the baseline, (a) (b) are performed on 20 Newsgroups, and (c) is performed on IMDB. 


mance, and ct-NCE greatly outperforms CD on re¬ 
trieval tasks especially around large recall values. 
The classification results of ct-NCE is also compa¬ 
rable or slightly better than CD. Simultaneously, 
it is gratifying to find that the /<^-weighting in¬ 
puts achieve the best results both in retrieval and 
classification tasks, as /(^-weighting is known to 
extract information better than word count. In ad¬ 
dition, naturally NCE performs poorly compared 
to others in Figure 2, indicating variant document 
lengths actually bias the learning greatly. 


CD 


a-NCE 


k=l 

k=5 k=25 

k=25 (idf) 

64.1% 

61.8% 

63.6% 64.8% 

65.6% 


Table 1: Comparison of classification accuracy on 
the 20 Newsgroups dataset using RSMs. 


Models 

Accuracy 

Bag of Words (BoW) (Maas and Ng, 2010) 

86.75% 

LDA (Maas et al., 2011) 

67.42% 

LSA (Maas etal.,2011) 

83.96% 

Maas et al. (2011)’s “full” model 

87.44% 

WRRBM (Dahl et al., 2012) 

87.42% 

RSM:CD 

86.22% 

RSM:a-NCE-5 

87.09% 

RSM:a-NCE-5 (idf) 

87.81% 


Table 2: The performance of sentiment classifica¬ 
tion accuracy on the IMDB dataset using RSMs 
compared to other BoW-based approaches. 

On the other hand. Table 2 shows the perfor¬ 
mance of RSM in sentiment classification, where 
model combinations reported in previous efforts 
are not considered. It is clear that a-NCE learns 
RSM better than CD, and outperforms BoW and 
other BoW-based models^ such as LDA. The idf- 

Accurately, WRRBM uses “bag of w-grams” assumption. 


weighting inputs also achieve the best perfor¬ 
mance. Note that RSM is also based on BoW, in¬ 
dicating ct-NCE has arguably reached the limits of 
learning BoW-based models. In future work, RSM 
can be extended to more powerful undirected topic 
models, by considering more syntactic informa¬ 
tion such as word-order or dependency relation¬ 
ship in representation. ct-NCE can be used to learn 
them efficiently and achieve better performance. 

4.4 Choice of Noise Level-a 

In order to decide the best noise level (cr) for PNS, 
we learned RSMs using a-NCE with different 
noise levels for both word count and /J/'-weighting 
inputs on the two datasets. Figure 3 shows that 
cr-NCE learning with partial noise (a > 0) out¬ 
performs full noise {a = 0) in most situations, 
and achieves better results than CD in retrieval and 
classification on both datasets. However, learning 
tends to become extremely difficult if the noise 
becomes too close to the data, and this explains 
why the performance drops rapidly when a ^ 1. 
Furthermore, curves in Figure 3 also imply the 
choice of a might be problem-dependent, with 
larger sets like IMDB requiring relatively smaller 
a. Nonetheless, a systematic strategy for choos¬ 
ing optimal a will be explored in future work. In 
practise, a range from 0.3 ^ 0.5 is recommended. 

5 Conclusions 

We propose a novel approach cr-NCE for learning 
undirected topic models such as RSM efficiently, 
allowing large vocabulary sizes. It is new a es¬ 
timator based on NCE, and adapted to documents 
with variant lengths and weighted inputs. We learn 
RSMs with a-NCE on two classic benchmarks, 
where it achieves both efficiency in learning and 
accuracy in retrieval and classification tasks. 
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